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FEATURE SELECTION FOR BEST m:.\N SOUARE APPROXIMATION OF CLASS DENSITIES 


ABSTRAC T. A criterion for linear feature selection is proposed which is 
based on mean square approximation of class density functions. It is 
shown that for the widest possible class of approxltnants, the criterion 
reduces to Devijver’s Bayesian distance. For linear approximants the 
criterion is equivalent to well known generalized Fisher criteria. 

Pattern recognition Feature selection 

Discriminant analysis Pattern class separability 



Feature Selection for Best Moan Square 
Approximation of Class Densities 

1. Introduction 

The purpose of this note is to describe a general mean square approach 

to linear feature selection which connects certain generalized Fisher criteria 

In discriminant analysis with a measure of pattern class separation lntro> 

( 2 ) 

duced by Devijver . The former are typical of those criteria which 
utilize only low order information about the pattern class distributions, 
while the latter requires that the class distributions be known, or at 
least accurately estimated. 

Let X denote a random vector in real n-space r” which arises from one 

of m pattern classes ... , 0^ having known prior probabilities 

... , a^, where > 0 and Z - 1. Let F (x) denote the j — class 
* i“l ^ m 

conditional distribution function of X and let F(x) « Z a. F, (x) denote 

1-1 n k 

the mixture distribution. For a given measureable transformation T:R -► R 
let G^(y, T) and G(y, T) denote, respectively, the j— class conditional 
distribution and mixture distribution of the random variable Y » TX. We 
let (resp. gj(y» T) ) denote the class conditional densities of X 

(reap. Y) with respect to their corresponding mixture distributions; l.e.. 


f 


j 



and gj ( . 


T) 


dG^(., T)^ 
dG(., T) 


We will restrict our attention to the set of linear transformations T of 

rank k, and assume that each pattern class 11^ has a mean and positive de- 

m 


Let p = I 
i“l 


finite covariance matrix 


2 . 


■/, “l<“l - *‘><''1 - 
1=1 


'w 


m 
‘ I 
i“l 


“i 


and S * S,, + S_ 

W D 

denote the between class scatter matrix, the average within class scatter 
matrix, and the total scatter matrix respectively. 

A number of Interesting feature selection criteria can be formulated 
using only the parameters p, S, S^, Sg; e.g., the criteria proposed 

by Kittler and Young Foley and Saramon^^\ Fukanaga and Koontz^^^, 

and the discrete analogue of the modified Karhunen-Loeve expansion of Chlen 
and Fu^^-. The modified K-L expansion minimizes an entropy function, and 
also best represents the pattern vector X in an overall least squares 
sense; however, its value for discrimination has been questioned by several 
authors (see Kittler^^^). Fukanaga^^^ considers several criteria of the 
generalized Fisher type, including 

Jj^(T) = tr(T^SyT)"^(T'^SgT). 

Thus, according to this criterion, the best k * n matrix T of rank k is one 
which maximizes Jj^(T). The solution is any T which is row equivalent to a 
k X n matrix whose rows are linearly independent principal eigenvectors 
(i.e., corresponding to the largest eigenvalues) of Sg. We also consider 


a tmjdlflcation 


3 . 


(T) = tr (T^ST)"^(T^SgT) 

c- 

which admits the same maximizing T. 

The Bayesian distance corresponding to the pattern classes 
(3) 

as defined by Devijver , is 

® 2 2 
B = E af E Cf,(X) 3 
n i i 


- J aj / „ dF(x). 

1-1 ^ r" ‘ 


Its transformed value is 

B. (T) = E a] f. g,(y. T)^ dG(y, T). 

^ 1=1 ^ R*" ^ 

Devijver proves a number of interesting Inequalities relating B^ to the 

Bayes probability of misclasslf ication, the Bhattacharrya coefficient » and 

'>ther measures of class separation. In addition, he notes that Cover and 

( 2 ) 

dart have shown that 1-B^^ is the asymptotic error rate of the nearest 
neighbor classifier. 

2 . Mean Square Optimality of Bay esian Distance 

For a given k >< n matrix T of rank k, let L^fT) denote the set of all 

k 1 [2 

measureable functions V : R R such that J_k q>(y) dG(y, T) < “ and let 

K 

be a given closed linear subspacc of L 2 (T). Our general approach to 
linear feature selection is to choose that T, if possible, which minimizes 


4 . 


m , 

R(T) -Eg min / L<p.(Tx) - f (x) 1" dF(x), 
i-1 r" ^ 

where the are positive weights. That is» we attempt to find a T which 

produces a set of approximations ^ ^(Tx) to the class densities f^(x) which 

is best in an overall mean square sense. Given such approximations we may 

classify observations of X according to the pseudo-Bayes rule; decide that 

X is from class 11^ if cp^(TX) for each j i. Since we are 

interested in classification accuracy, it seems appropriate to choose weights 

which reflect the relative importance of the classes in the mixture 

2 

distribution; e.g., for all i or for all i. For the 

2 

remainder of this section we choose B^ ■ and » L 2 (T). 

Proposition 1 ; For i “ 1. m and « L, (T) for each T, 

R(T) “ - Bj^(T). 

Proof ; Observe that g^(y, T) e L 2 (T), since it is bounded by Moreover, 

for each Vc L 2 (T), 


/ g(Tx)[g (Tx. T) - f.(x)] dF(x) 
R ^ 1 


- / cp(Tx) g.(Tx, T) dF(x) - / «P(Tx) dP (x) 

j^n 1 j^n i. 


- / k <P(y) gj(y. T) dG(y, T) - / ^ 4»(y) d G^(y, T) 
R R 


0 . 


s. 


Therefore, 


min / C9(Tx) - £.(x)]^ dF(x) 
«fleL2 (T) R" 


- / [g.(Tx, T) - f (x)]^ dF(x) 

R 

- / „ f.(x)^ dF(x) - / „ g.(y, T)^ dG(y. T). 

K R 

2 

The assertion of the proposition follows on multiplying by and auuning 
over 1. 

We may suamarize by saying that if there exists a k » n matrix Tq of 
rank k which maximizes B^(T), then the functions g^(T^x, ^q) • •••! ■'o' 

constitute the best mean square approximation to the class densities 

f , (x) £ (x) attainable through a linear compression of the data into 

k dimensions. Since Bj^(QT) - nonsingular k k matrix Q and 

each k n matrix T of rank k, has a maximum if and only if it has a 

maximum on the compact set {T j TT ■ In particular, if is 

continuous, it has a maximum. 

3 . Best Linear Approximation of Class Densities 

T 

In this section we let be the set of functions ‘-^y) ■ w + b y, where 

w is a real number and b c R . For simplicity, we use the notation 

T w k+l 1 k+1 

«p(y) ■ a v(y), where a « c ^ and v(y) =» (--) » R . For given T, 

w ^ 

a. - (^) minimizes 
i b^ 
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/ [a'^v(Tx) - dF(x) 


C/ v(T*)v(Tx)^ 4F(x}]a 


-2a^ / ^ v(Tx) dF,(x) + / ^ tA%y dF(x) 
r" ^ R® ^ 




+ / „ dP(x) 

r" ‘ 


if aiHi only If 


1 I T^Y "i ) AJ\ 

ri/ 


where W ■ E CXX^3 ■ S + IJU^. Solving this aysteo gives 


- 1 - T(P^ - M) 


-'(TST^)“^ T(M^ - P). 


The corresponding squared error of approximation Is 


-1- (P^ - P)^ T^(TST^)“^ T(P^ - ^ + J „ f^(x)^ dF(x) 


7 . 


Therefore, the criterion to be minimized is 

R(T) - - E R (ui - W)^ T^dST*^)"^ T(U^ - U) 
1*1 

-f- terms independent of T. 

That is, we want to maximize 


R(T) - trace (TST^)~^ TS^^, 

P 

V »ere 

s ■ 

The solution is T ■ Q^q» where is a k * n matrix whose rows are linearly 
- ■'I*' 

Independent principal eigenvalues of, S S and Q is an arbitrary nonsingu- 

a 

lar k ^ k matrix. In particular, for we obtain the same solution 

given by Fukanaga's criterion, 


trace (TS T*^)"^ (TS-T^). 

W D 


4. Concluding Remarks 

An equivalent set of criteria for feature selection are expressions 
such as 


R(T) 


Ul M 

■ E Bi min / [9(Tx) - a. f.(x)] dF(x) 

1-1 <l>eC^ R" ^ ‘ 
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in which the posterior probabilities a^f^(x) of the classes are approximated. 

_ 2 
If each is chosen to be 1 R(T) is the same as RfT) with * oij • 

This, together with Proposition 1 and the relationship between Bayesian 

distance and the probability of error, seems to indicate that the choice 
2 

®i " *^i ® good erne. 

In some cases it may be numerically feasible to use as a feature 

selection criterion when assumptions about the parametric form of the class 
distributions are made. For example, if each class distribution is 
multivariate normal, then B^(T) reduces to an expression which is continu- 
ously differentiable in X and which, moreover, can be approximated by 
sample averages over an unlabcled sample from the mixture distribution. 

Thus descent algorithrs might be successfully employed in maximizing Bj^(T). 

Finally, we remark that there is no reason in principle why 
cannot be regarded as a criterion for nonlinear feature extraction. Indeed, 
Proposition 1 remains true when T is any measureable ttwinsfonnation from 
r” onto R^. 
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